Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms
نویسندگان
چکیده
Empirical research in learning algorithms for classification tasks generally requires the use of significance tests. The quality of a test is typically judged on Type I error (how often the test indicates a difference when it should not) and Type II error (how often it indicates no difference when it should). In this paper we argue that the replicability of a test is also of importance. We say that a test has low replicability if its outcome strongly depends on the particular random partitioning of the data that is used to perform it. We present empirical measures of replicability and use them to compare the performance of several popular tests in a realistic setting involving standard learning algorithms and benchmark datasets. Based on our results we give recommendations on which test to use.
منابع مشابه
Evaluating machine learning methods and satellite images to estimate combined climatic indices
The reflections recorded on satellite images have been affected by various environmental factors. In these images, some of these factors are combined with other environmental factors that cannot be distinguished. Therefore, it seems wise to model these environmental phenomena in the form of hybrid indicators. In this regard, satellite imagery and machine learning methods can play a unique role ...
متن کاملChoosing Learning Algorithms Using Sign Tests with High Replicability
An important task in machine learning is determining which learning algorithm works best for a given data set. When the amount of data is small the same data needs to be used repeatedly in order to get a reasonable estimate of the accuracy of the learning algorithms. This results in violations of assumptions on which standard tests are based and makes it hard to design a good test. In this arti...
متن کاملChoosing Between Two Learning Algorithms Based on Calibrated Tests
Designing a hypothesis test to determine the best of two machine learning algorithms with only a small data set available is not a simple task. Many popular tests suffer from low power (5x2 cv [2]), or high Type I error (Weka’s 10x10 cross validation [11]). Furthermore, many tests show a low level of replicability, so that tests performed by different scientists with the same pair of algorithms...
متن کاملUsing Machine Learning Algorithms for Automatic Cyber Bullying Detection in Arabic Social Media
Social media allows people interact to express their thoughts or feelings about different subjects. However, some of users may write offensive twits to other via social media which known as cyber bullying. Successful prevention depends on automatically detecting malicious messages. Automatic detection of bullying in the text of social media by analyzing the text "twits" via one of the machine l...
متن کاملOptimizing the AGC system of a three-unequal-area hydrothermal system based on evolutionary algorithms
This paper focuses on expanding and evaluating an automatic generation control (AGC) system of a hydrothermal system by modelling the appropriate generation rate constraints to operate practically in an economic manner. The hydro area is considered with an electric governor and the thermal area is modelled with a reheat turbine. Furthermore, the integral controllers and electri...
متن کامل